Assigning a Quality Measurement to Matching Records from Heterogeneous Legacy Databases: A Practical Experience
نویسنده
چکیده
In this paper we present the results of a practical exercise in matching records from two different legacy databases. This exercise is a portion of our work for the US Defense Logistics Agency, (DLA). It addresses the problem of identifying parts in the database of DLA managed items that match parts available from a commercial online catalog. A method of automatically generating approximate matching records is described and a procedure is presented for assigning scores to these matches. Finally a statistical quality sampling standard is applied to these match results to generate a quality measure for each score. This gives DLA an objective means to determine whether or not a given part can be ordered from the commercial catalog. This methodology can be applied to any data manipulation process such as data cleaning or data mining where the process quality can be independently verified through statistical sampling. 1) Introduction. As organizations become more web aware they increasingly need to compare their own internal data to data available from outside web sources. Consider the case of ordering parts from an online catalog. The organization placing this order needs some assurance that the part it requires as specified by its internal data is the same part that is ordered. This necessitates matching the internal data for the part to external data presented by the catalog source. It also requires some quantitative measure of how good the match is. Such a situation arose as part of our work for the US Defense Logistics Agency, (DLA). DLA wanted to determine which of the parts it managed could be ordered from commercial online catalogs. We developed a method of automatically matching DLA’s internal part data to part data from online catalogs. This generated possible part matches, which were scored on how well they matched. Each score level was then assigned a quality measure using statistical sampling techniques outlined in MIL STD 105 [3], a well-known quality measurement standard. By sampling matches made at various score levels and manually examining the samples to judge if they were indeed good matches, we were able to assign an Acceptable Quality Level, (AQL), to each score. The crucial point here is that match scores were generated by the automatic matching process while AQLs for the scores were developed from a manual inspection process using parts of the data not accounted for in the automatic process. DLA can now use these AQLs to determine if a part can be ordered from an online catalog based on the risk associated with getting and using a wrong part. This is similar to ordering parts from a manufacturer based on the AQL of the manufacturing process. Proceedings of the 2000 Conference on Information Quality 71 The following sections of the paper describe the details of this data matching process. Section 2 outlines the matching methodology. This is the process of automatically generating matches. In Section 3 we describe the way matches were scored. Scores were based on the success of various components of the matching process. Section 4 presents the quality validation of the different score levels. This is the result of the sampling and inspection of matches and assigns an AQL to each score level. Finally, in section 5 we present our conclusions and plans for future work. 2) The Record Matching Methodology. The problem of determining which parts managed by DLA are available from commercial online catalogs exists because DLA and the online catalogs maintain different data about parts. The differences in these legacy databases makes it difficult to directly join records from the two sources to establish matches. Data about parts from DLA contains the following fields: NSN – National stock number (Unique Federal identification number) Cage Code – ID for registered source for part Manufacturer Part Number – assigned by manufacturer Nomenclature – official name of part (assigned by DLA) Vendor Name – Entity name Cage Code is assigned to Vendor Address – Address of entity The specific exercise attempted to match DLA parts as described by the above records to parts from Newark Electronics, an online vendor of electronic parts. Data records about parts from Newark Electronics took the following form: Commercial Part Number – assigned by manufacturer Part description – field containing free text describing part Catalog Number – Unique ID assigned by Newark Electronics Vendor ID – Number assigned by Newark to identify source of part Vendor Name and Address – Newark source for part The only field that is identical in both the DLA and Newark Electronics data is the Manufacturer Part Number/Commercial Part Number. This number is assigned by the manufacturer and is identical in both databases. All other field values are assigned by the respective data owners and have no guarantee of matching. Even vendor names and addresses vary significantly between the two databases. Newark Electronics might use 3M as the manufacturer’s name and DLA uses Minnesota Mining and Manufacturing. DLA may contain a street address and Newark Manufacturing a Post Office box. However, joining the two databases on the part number field does not insure that parts will match. While manufacturer part numbers are unique for parts from a given manufacturer, these numbers are not unique across manufacturers. The same number can be assigned to many parts from different manufacturers. Matching the part number is necessary to insure a part equivalence, but it is not sufficient. The approach we take to establish matches comprises three steps. First, we join the DLA and Newark Electronics databases on the part number field. Any match produced from this join is a Proceedings of the 2000 Conference on Information Quality 72 possibly equivalent part. The matches so obtained produce DLA Cage Code/Newark Vendor ID pairs. Second, we standardize name and address data from both DLA and Newark Electronics. This standardization process produces records that contain the following fields for both DLA and Newark Electronic data: Company Name Alternate Name Street Number Street Name PO Box City State Zip Code For DLA, this record is associated with a Cage Code and, for Newark Electronics, with a Vendor ID. The alternate name field is only populated for DLA records since DLA often lists division and subsidiary names along with the company name. The third step in generating matches is to compare the above standardized records for a given Cage Code/Vendor ID pair and score how well the various fields match. Details about scoring the matches are presented in Section 3. The standardization of names and addresses is described in [1]. Both the standardization and the scoring of matches are implemented in XSB, [2], an efficient deductive logic engine well suited for natural language analysis. 3) Scoring the Matches. The critical issue in determining whether a part from DLA is equivalent to a part from Newark Electronics revolves around how well the name and address records for a Cage Code/ Vendor ID pair match. We generate a score for this match, which assigns a quantitative value for the closeness of match. The score is composed of a number of components and is the sum of these components. The components are as follows: Cage Code/Vendor ID frequency – this component is a measure of how many part numbers produce the given Cage Code/Vendor ID pair. This component is scored from 1 to 4 on an approximately logarithmic scale. Name – this component indicates how well the name matches. Names are represented as strings of words and a score is based on the size of the longest common subsequence in the two name strings. This score ranges from 0 to 5 for matching company name to company name and 0 to 5 for matching company name to alternate name. Street Number – this component give a score of 0 when street numbers don’t match, a score of 1 if the do match but zip codes don’t match, and a score of 2 if both street number and zip code match PO Box – this component scores PO Box matches from 0 to 2 in a similar manner to street numbers. Proceedings of the 2000 Conference on Information Quality 73 Zip Code – this component scores a 0 if zip codes don’t match, a 1 if 5 digit zip codes match, and a 2 if 9 digit zip codes match. A total score for a Cage Code/Vendor ID match is achieved by summing the component scores with the caveat that the score is 0 unless some component besides the Cage Code/Vendor ID frequency produces a non-zero result. The highest possible score for a match is 20. The best score we achieved was 14. This is a good indication of how different name and address data can be between the two databases. There are approximately 150,000 items in the Newark Electronics catalog data. There are approximately 4,000,000 managed NSNs in the DLA database. The join on part numbers produced approximately 249,000 possible matches. Of these only approximately 30,000 had scores higher that 0. The distribution of scores is presented in figure 1. 80% of these matches scores 5 or better. Our initial assumption was that these indicated good matches, but we needed some method of quantifying how good. In the next section we describe using statistical quality sampling techniques to assign a quality to each score level. 4) Validating Scores. The idea behind assigning a quality measure to match scores is the same as that for statistical quality control. One such measure, described in MIL STD 105 [3], is Acceptable Quality Level, (AQL). AQL is defined as the number of defects per 100 items produced. In the context of producing matches between DLA parts and Newark Electronic parts AQL would indicate the number of false matches per 100 matches generated. In this case a match is either correct or incorrect, so there can be at most one defect per match. The procedure defined in MIL STD 105 is to choose a random sample from a production run based on the size of the run. This sample is inspected for defects. To achieve a certain AQL, the sample can have at most a certain number of defects.
منابع مشابه
Reviewing the Status and Experience of Outsourcing Policy in Healthcare: A review study
Background: Outsourcing is a method and tool for delivering services to increase effectiveness and efficiency. Regarding the outsourcing policies, identifying experiences and outsourcing records is essential. In this regard, the present study aimed at examining the status and experience of outsourcing policy in the health sector as a review study. Methods: The present systematic review was car...
متن کاملThe legacy of Islamic world in modern medicine and science
The legacy of the Islamic world in medicine and natural science is the legacy of Greece, increased by many additions, mostly practical. Rhazes, the Iranian, was a talented clinical observer, but not a Harvey. Abd al-Latif, the Arab, was a diligent seeker in anatomy, but in no way to be compared to Vesalius. The Muslims possessed excellent translations of the works of the Hippocratic Corpus and ...
متن کاملThe legacy of Islamic world in modern medicine and science
The legacy of the Islamic world in medicine and natural science is the legacy of Greece, increased by many additions, mostly practical. Rhazes, the Iranian, was a talented clinical observer, but not a Harvey. Abd al-Latif, the Arab, was a diligent seeker in anatomy, but in no way to be compared to Vesalius. The Muslims possessed excellent translations of the works of the Hippocratic Corpus and ...
متن کاملMatching Scores of System Relevance and User-Oriented Relevance in SID, ISC and Google Scholar
Background and Aim: The main aim of Information storage and retrieval systems is keeping and retrieving the related information means providing the related documents with users’ needs or requests. This study aimed to answer this question that how much are the system relevance and User- Oriented relevance are matched in SID, SCI and Google Scholar databases. Method: In this study 15 keywords of ...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کامل